14 research outputs found

    Redundancy, Deduction Schemes, and Minimum-Size Bases for Association Rules

    Full text link
    Association rules are among the most widely employed data analysis methods in the field of Data Mining. An association rule is a form of partial implication between two sets of binary variables. In the most common approach, association rules are parameterized by a lower bound on their confidence, which is the empirical conditional probability of their consequent given the antecedent, and/or by some other parameter bounds such as "support" or deviation from independence. We study here notions of redundancy among association rules from a fundamental perspective. We see each transaction in a dataset as an interpretation (or model) in the propositional logic sense, and consider existing notions of redundancy, that is, of logical entailment, among association rules, of the form "any dataset in which this first rule holds must obey also that second rule, therefore the second is redundant". We discuss several existing alternative definitions of redundancy between association rules and provide new characterizations and relationships among them. We show that the main alternatives we discuss correspond actually to just two variants, which differ in the treatment of full-confidence implications. For each of these two notions of redundancy, we provide a sound and complete deduction calculus, and we show how to construct complete bases (that is, axiomatizations) of absolutely minimum size in terms of the number of rules. We explore finally an approach to redundancy with respect to several association rules, and fully characterize its simplest case of two partial premises.Comment: LMCS accepted pape

    Theoretical bounds on the size of condensed representations

    No full text
    Recent studies demonstrate the usefulness of condensed representations as a semantic compression technique for the frequent itemsets. Especially in inductive databases, condensed representations are a useful tool as an intermediate format to support exploration of the itemset space. In this paper we establish theoretical upper bounds on the maximal size of an itemset in different condensed representations. A central notion in the development of the bounds are the l-free sets, that form the basis of many well-known representations. We will bound the maximal cardinality of an l-free set based on the size of the database. More concrete, we compute a lower bound for the size of the database in terms of the size of the l-free set, and when the database size is smaller than this lower bound, we know that the set cannot be l-free. An efficient method for calculating the exact value of the bound, based on combinatorial identities of partial row sums, is presented. We also present preliminary results on a statistical approximation of the bound and we illustrate the results with some simulations

    Quick inclusion-exclusion

    No full text
    Many data mining algorithms make use of the well-known Inclusion-Exclusion principle. As a consequence, using this principle efficiently is crucial for the success of all these algorithms. Especially in the context of condensed representations, such as NDI, and in computing interesting measures, a quick inclusion-exclusion algorithm can be crucial for the performance. In this paper, we give an overview of several algorithms that depend on the inclusion-exclusion principle and propose an efficient algorithm to use it and evaluate its complexity. The theoretically obtained results are supported by experimental evaluation of the quick IE technique in isolation, and of an example application

    Mining all non-derivable frequent itemsets

    No full text
    Recent studies on frequent itemset mining algorithms resulted in significant performance improvements. However, if the minimal support threshold is set too low, or the data is highly correlated, the number of frequent itemsets itself can be prohibitively large. To overcome this problem, recently several proposals have been made to construct a concise representation of the frequent itemsets, instead of mining all frequent itemsets. The main goal of this paper is to identify redundancies in the set of all frequent itemsets and to exploit these redundancies in order to reduce the result of a mining operation. We present deduction rules to derive tight bounds on the support of candidate itemsets. We show how the deduction rules allow for constructing a minimal representation for all frequent itemsets. We also present connections between our proposal and recent proposals for concise representations and we give the results of experiments on real-life datasets that show the effectiveness of the deduction rules. In fact, the experiments even show that in many cases, first mining the concise representation, and then creating the frequent itemsets from this representation outperforms existing frequent set mining algorithms

    Integrating pattern mining in relational databases

    No full text
    Almost a decade ago, Imielinski and Mannila introduced the notion of Inductive Databases to manage KDD applications just as DBMSs successfully manage business applications. The goal is to follow one of the key DBMS paradigms: building optimizing compilers for ad hoc queries. During the past decade, several researchers proposed extensions to the popular relational query language, SQL, in order to express such mining queries. In this paper, we propose a completely different and new approach, which extends the DBMS itself, not the query language, and integrates the mining algorithms into the database query optimizer. To this end, we introduce virtual mining views, which can be queried as if they were traditional relational tables (or views). Every time the database system accesses one of these virtual mining views, a mining algorithm is triggered to materialize all tuples needed to answer the query. We show how this can be done effectively for the popular association rule and frequent set mining problems

    An inductive database prototype based on virtual mining views

    No full text
    We present a prototype of an inductive database. Our system enables the user to query not only the data stored in the database but also generalizations (e.g. rules or trees) over these data through the use of virtual mining views. The mining views are relational tables that virtually contain the complete output of data mining algorithms executed over a given dataset. The prototype implemented into PostgreSQL currently integrates frequent itemset, association rule and decision tree mining. We illustrate the interactive and iterative capabilities of our system with a description of a complete data mining scenario

    Minimal k-free representations of frequent sets

    No full text
    Due to the potentially immense amount of frequent sets that can be generated from transactional databases, recent studies have demonstrated the need for concise representations of all frequent sets. These studies resulted in several successful algorithms that only generate a lossless subset of the frequent sets. In this paper, we present a unifying framework encapsulating most known concise representations. Because of the deeper understanding of the different proposals thus obtained, we are able to provide new, provably more concise, representations. These theoretical results are supported by several experiments showing the practical applicability

    Efficient pattern mining of uncertain data with sampling

    No full text
    Mining frequent itemsets from transactional datasets is a well known problem with good algorithmic solutions. In the case of uncertain data, however, several new techniques have been proposed. Unfortunately, these proposals often suffer when a lot of items occur with many different probabilities. Here we propose an approach based on sampling by instantiating "possible worlds" of the uncertain data, on which we subsequently run optimized frequent itemset mining algorithms. As such we gain efficiency at a surprisingly low loss in accuracy. These is confirmed by a statistical and an empirical evaluation on real and synthetic data

    Mining itemsets in the presence of missing values

    No full text
    Missing values make up an important and unavoidable problem in data management and analysis. In the context of association rule and frequent itemset mining, however, this issue never received much attention. Nevertheless, the well known measures of support and confidence are misleading when missing values occur in the data, and more suitable definitions typically don't have the crucial monotonicity property of support. In this paper, we overcome this problem and provide an efficient algorithm, XMiner, for mining association rules and frequent itemsets in databases with missing values. XMiner is empirically evaluated, showing a clear gain over a straightforward baseline-algorithm

    Database transposition for constrained (closed) pattern mining

    No full text
    Abstract. Recently, different works proposed a new way to mine patterns in databases with pathological size. For example, experiments in genome biology usually provide databases with thousands of attributes (genes) but only tens of objects (experiments). In this case, mining the “transposed ” database runs through a smaller search space, and the Galois connection allows to infer the closed patterns of the original database. We focus here on constrained pattern mining for those unusual databases and give a theoretical framework for database and constraint transposition. We discuss the properties of constraint transposition and look into classical constraints. We then address the problem of generating the closed patterns of the original database satisfying the constraint, starting from those mined in the “transposed ” database. Finally, we show how to generate all the patterns satisfying the constraint from the closed ones.
    corecore